Grouping and summarizing

Summarizing the median life expectancy

You’ve seen how to find the mean life expectancy and the total population across a set of observations, but mean() and sum() are only two of the functions R provides for summarizing a collection of numbers. Here, you’ll learn to use the median() function in combination with summarize().

By the way, dplyr displays some messages when it’s loaded that we’ve been hiding so far. They’ll show up in red and start with:

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

This will occur in future exercises each time you load dplyr: it’s mentioning some built-in functions that are overwritten by dplyr. You won’t need to worry about this message within this course.

# Load the knitr and kableExtra packages
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
# Load the gapminder package
library(gapminder)
# Load the dpylr package
library(dplyr)
# Load the ggplot2 package as well
library(ggplot2)
theme_set(theme_bw())  # pre-set the bw theme.
# Summarize to find the median life expectancy
gapminder %>%
  summarize(medianLifeExp = median(lifeExp)) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
medianLifeExp
60.7125

Summarizing the median life expectancy in 1957

Rather than summarizing the entire dataset, you may want to find the median life expectancy for only one particular year. In this case, you’ll find the median in the year 1957.

# Filter for 1957 then summarize the median life expectancy
gapminder %>%
  filter(year == 1957) %>%
  summarize(medianLifeExp = median(lifeExp)) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
medianLifeExp
48.3605

Summarizing multiple variables in 1957

The summarize() verb allows you to summarize multiple variables at once. In this case, you’ll use the median() function to find the median life expectancy and the max() function to find the maximum GDP per capita.

# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita
gapminder  %>%
   filter(year == 1957) %>%
   summarize(medianLifeExp = median(lifeExp), maxGdpPercap= max(gdpPercap)) %>% 
   kable() %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
   row_spec(0, bold = T, color = "white", background = "#3f7689")
medianLifeExp maxGdpPercap
48.3605 113523.1

Summarizing by year

Now, you’ll perform those two summaries within each year in the dataset, using the group_by verb.

# Find median life expectancy and maximum GDP per capita in each year
gapminder %>%
   group_by(year) %>%
   summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>% 
   kable() %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
   row_spec(0, bold = T, color = "white", background = "#3f7689")
year medianLifeExp maxGdpPercap
1952 45.1355 108382.35
1957 48.3605 113523.13
1962 50.8810 95458.11
1967 53.8250 80894.88
1972 56.5300 109347.87
1977 59.6720 59265.48
1982 62.4415 33693.18
1987 65.8340 31540.97
1992 67.7030 34932.92
1997 69.3940 41283.16
2002 70.8255 44683.98
2007 71.9355 49357.19

Interesting: notice that median life expectancy across countries is generally going up over time, but maximum GDP per capita is not.

Summarizing by continent

You can group by any variable in your dataset to create a summary. Rather than comparing across time, you might be interested in comparing among continents. You’ll want to do that within one year of the dataset: let’s use 1957.

# Find median life expectancy and maximum GDP per capita in each continent in 1957
gapminder %>%
   filter(year == 1957) %>%
   group_by(continent) %>%
   summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>% 
   kable() %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
   row_spec(0, bold = T, color = "white", background = "#3f7689")
continent medianLifeExp maxGdpPercap
Africa 40.5925 5487.104
Americas 56.0740 14847.127
Asia 48.2840 113523.133
Europe 67.6500 17909.490
Oceania 70.2950 12247.395

Summarizing by continent and year

Instead of grouping just by year, or just by continent, you’ll now group by both continent and year to summarize within each.

gapminder %>%
   group_by(continent, year) %>%
   summarize(medianLifeExp = median(lifeExp), maxGdpPercap = max(gdpPercap)) %>% 
   kable() %>%
   kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
   row_spec(0, bold = T, color = "white", background = "#3f7689") %>% 
   scroll_box(width = "100%", height = "300px")
continent year medianLifeExp maxGdpPercap
Africa 1952 38.8330 4725.296
Africa 1957 40.5925 5487.104
Africa 1962 42.6305 6757.031
Africa 1967 44.6985 18772.752
Africa 1972 47.0315 21011.497
Africa 1977 49.2725 21951.212
Africa 1982 50.7560 17364.275
Africa 1987 51.6395 11864.408
Africa 1992 52.4290 13522.158
Africa 1997 52.7590 14722.842
Africa 2002 51.2355 12521.714
Africa 2007 52.9265 13206.485
Americas 1952 54.7450 13990.482
Americas 1957 56.0740 14847.127
Americas 1962 58.2990 16173.146
Americas 1967 60.5230 19530.366
Americas 1972 63.4410 21806.036
Americas 1977 66.3530 24072.632
Americas 1982 67.4050 25009.559
Americas 1987 69.4980 29884.350
Americas 1992 69.8620 32003.932
Americas 1997 72.1460 35767.433
Americas 2002 72.0470 39097.100
Americas 2007 72.8990 42951.653
Asia 1952 44.8690 108382.353
Asia 1957 48.2840 113523.133
Asia 1962 49.3250 95458.112
Asia 1967 53.6550 80894.883
Asia 1972 56.9500 109347.867
Asia 1977 60.7650 59265.477
Asia 1982 63.7390 33693.175
Asia 1987 66.2950 28118.430
Asia 1992 68.6900 34932.920
Asia 1997 70.2650 40300.620
Asia 2002 71.0280 36023.105
Asia 2007 72.3960 47306.990
Europe 1952 65.9000 14734.233
Europe 1957 67.6500 17909.490
Europe 1962 69.5250 20431.093
Europe 1967 70.6100 22966.144
Europe 1972 70.8850 27195.113
Europe 1977 72.3350 26982.291
Europe 1982 73.4900 28397.715
Europe 1987 74.8150 31540.975
Europe 1992 75.4510 33965.661
Europe 1997 76.1160 41283.164
Europe 2002 77.5365 44683.975
Europe 2007 78.6085 49357.190
Oceania 1952 69.2550 10556.576
Oceania 1957 70.2950 12247.395
Oceania 1962 71.0850 13175.678
Oceania 1967 71.3100 14526.125
Oceania 1972 71.9100 16788.629
Oceania 1977 72.8550 18334.198
Oceania 1982 74.2900 19477.009
Oceania 1987 75.3200 21888.889
Oceania 1992 76.9450 23424.767
Oceania 1997 78.1900 26997.937
Oceania 2002 79.7400 30687.755
Oceania 2007 80.7195 34435.367

Visualizing summarized data

Visualizing median life expectancy over time

In the last chapter, you summarized the gapminder data to calculate the median life expectancy within each year. Created as the by_year dataset.

Now you can use the ggplot2 package to turn this into a visualization of changing life expectancy over time.

by_year <- gapminder %>%
  group_by(year) %>%
  summarize(medianLifeExp = median(lifeExp),
            maxGdpPercap = max(gdpPercap))
# Create a scatter plot showing the change in medianLifeExp over time
ggplot(by_year, aes(x = year, y = medianLifeExp)) +
   geom_point() +
   expand_limits(y = 0) +
   labs(subtitle="Life expectancy over time", 
        y="Life expectancy", 
        x="Year", 
        title="Scatterplot", 
        caption = "")

It looks like median life expectancy across countries is increasing over time.

Visualizing median GDP per capita per continent over time

In the last exercise you were able to see how the median life expectancy of countries changed over time. Now you’ll examine the median GDP per capita instead, and see how the trend differs among continents.

# Summarize medianGdpPercap within each continent within each year:
by_year_continent <- gapminder %>%
   group_by(continent, year) %>%
   summarize(medianGdpPercap = median(gdpPercap))

# Plot the change in medianGdpPercap in each continent over time
ggplot(by_year_continent, aes(x = year, y = medianGdpPercap, color = continent)) +
   geom_point() +
   expand_limits(y = 0) +
   labs(subtitle="Median GDP per capita over time by continent", 
        y="GDP per capita", 
        x="Year", 
        title="Scatterplot", 
        caption = "")

Comparing median life expectancy and median GDP per continent in 2007

In these exercises you’ve generally created plots that show change over time. But as another way of exploring your data visually, you can also use ggplot2 to plot summarized data to compare continents within a single year.

# Summarize the median GDP and median life expectancy per continent in 2007
by_continent_2007 <- gapminder %>% 
   filter(year == 2007) %>% 
   group_by(continent) %>% 
   summarize(medianLifeExp = median(lifeExp), medianGdpPercap = median(gdpPercap))

# Use a scatter plot to compare the median GDP and median life expectancy
ggplot(by_continent_2007, aes(x = medianGdpPercap, y = medianLifeExp, color = continent)) +
   geom_point() +
   expand_limits(y = 0) +
   labs(subtitle="Median life expectancy with median GDP per continent in 2007", 
        y="Median Life expectancy", 
        x="Median GDP per capita", 
        title="Scatterplot", 
        caption = "")